Add fix for prerenderer cache-clearing on publish by backspace · Pull Request #4719 · cardstack/boxel

backspace · 2026-05-07T19:31:40Z

Publishing a realm has been inconsistent in whether the changes actually show up, which has indeed been about caching, but across multiple layers. I’m out of my depth here, but here’s Claude’s explanation (edited after PR feedback about unnecessary work):

Bug

After republish, the published URL kept serving pre-republish HTML. In production:
nyuitp2026.boxel.site served the old wordmark for ~37h after the user pushed the new img form.

Root cause

The Realm class holds an in-memory #sourceCache keyed by path. After a republish:

Publish handler does the FS swap and enqueues a reindex.
The index worker fetches source over HTTP into the realm-server.
getSourceOrRedirect returns pre-swap bytes from #sourceCache (plus a stale content-hash etag).
The worker writes stale HTML into boxel_index.isolated_html.

The design assumed the NodeAdapter file-watcher would invalidate #sourceCache, but
ENABLE_FILE_WATCHER is unset in staging/production, so it never fires.

Fix

One commit. New Realm.clearLocalCaches() method; handle-publish-realm calls it after the FS swap,
before the reindex enqueue. Makes the invalidation synchronous — no race, no file-watcher
dependency.

github-actions · 2026-05-07T20:40:28Z

Preview deployments

Host staging preview

Host production preview

Host Test Results

1 files ±0 1 suites ±0 1h 54m 0s ⏱️ - 9m 6s
2 661 tests ±0 2 646 ✅ ±0 15 💤 ±0 0 ❌ ±0
2 680 runs ±0 2 665 ✅ ±0 15 💤 ±0 0 ❌ ±0

Results for commit ceeed01. ± Comparison against earlier commit df9d319.

Realm Server Test Results

1 files ±0 1 suites ±0 11m 6s ⏱️ +8s
1 367 tests - 1 1 367 ✅ ±0 0 💤 ±0 0 ❌ - 1
1 446 runs - 1 1 446 ✅ ±0 0 💤 ±0 0 ❌ - 1

Results for commit ceeed01. ± Comparison against earlier commit df9d319.

…11043

Covers the gap that let CS-11043 ship: every existing publish-realm test does a single publish, so a republish that signals success but serves stale content has no regression net. The new test defines a sentinel card, publishes, asserts the initial sentinel renders on the published URL, edits the source, republishes, and asserts the updated sentinel renders (and the initial one is gone). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…11043

I’m not sure…??

…11043

The host-side Loader-reset approach (sticky-for-batch + the two-flavor clearCache/resetLoaderOnly refinement) was causing widespread CI fallout — query-field server-hydration tests, live-update tests, file-tree navigation, code-submode preview, multi-reindex timeouts, host memory baseline. The Loader reset also doesn't directly address the actual root cause investigated under CS-11043: the stale module bytes the published nyuitp2026 realm served for ~37h were sitting in Chromium's process-level HTTP cache, not the host's in-process Loader cache. Roll back to pre-fix state. A follow-up commit on this branch targets the Chromium-cache layer directly via Cache-Control on the realm-server's source/module responses. Kept on the branch (independently useful regardless of fix mechanism): - packages/matrix/tests/publish-realm.spec.ts republish test (the regression net that fills the gap which let CS-11043 ship in the first place). - infra/ Checkly canary script + Terraform + provisioning runbook on the checkly-publish-cs-11096 branch in the infra repo (production monitoring). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…11043) The publish-republish failure mode investigated under CS-11043 was Chromium's process-level HTTP cache holding stale module bytes (presentation.gts and friends) across publishes — verified by inspecting the staging nyuitp2026 realm's failed render. Source responses were sent with no `Cache-Control` and no `Last-Modified` header (verified via `curl -sI`), which lets Chromium apply heuristic caching. That cache lives at the browser-process level, shared across all puppeteer pages on a prerender-server task. After a republish, even freshly-spawned puppeteer pages could pull old module bytes from the persistent Chromium cache for ~37h, until the prerender-server task itself rotated. `Cache-Control: no-store` evicts the heuristic-cache vector entirely. Every source/module fetch goes back to the realm-server, which serves whatever bytes are current on EFS. Cost: no browser cache reuse for unchanged source files — acceptable because card content is prerendered into boxel_index.isolated_html by the indexer and not typically re-fetched per page view anyway. Applied at the single place every source response passes through (`getSourceOrRedirect`'s defaultHeaders); cached-redirect entries go through a separate header map that doesn't get the new value, which is fine — 302 redirects are not the stale-bytes vector. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…11043

…(CS-11043) The Cache-Control: no-store change targets Chromium's HTTP cache, but the publish-republish staleness investigated under CS-11043 also lived a layer up: the prerender server's puppeteer pages hold a host-app `Loader` that caches evaluated modules by URL. After a republish swaps new bytes onto disk, that Loader would keep handing back the OLD module on subsequent renders — the HTTP layer's no-store header doesn't reach into the host's module cache. In production this manifested as nyuitp2026.boxel.site rendering the wordmark form of presentation.gts for ~37h after publishing the img form. The matrix republish test (added on this branch) reproduces the same failure in the test env: the second publish's reindex runs on the warm prerender, the Loader serves the initial sentinel-card module, and the published URL serves the initial sentinel even though disk has the updated source. Fix: after the FS swap completes (and alongside the existing DELETE FROM modules DB-cache clear), call prerenderer.disposeAffinity({ affinityType: 'realm', affinityValue: publishedRealmURL }). This tears down the puppeteer pages for the affinity; the next render against the realm spawns a fresh page that fetches modules from disk via the realm-server. Made `disposeAffinity` optional on the Prerenderer interface (matching the `releaseBatch?` pattern) so stub / remote implementations aren't forced to provide it. The call is best-effort: a thrown error is logged via log.warn but doesn't fail the publish, since the page-pool's LRU rotation cleans up eventually — we just want to avoid the long staleness window. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…(CS-11043) The previous commit added a publish-handler call to `prerenderer.disposeAffinity(...)`, but in any deployment that routes prerender requests over HTTP (which is what every real deployment uses, and what the matrix test infra uses via isolated-realm-server), the realm-server holds a `RemotePrerenderer` — which didn't implement `disposeAffinity`, so the call silently no-op'd via `prerenderer?.disposeAffinity`. Wire the call through the same plumbing the existing `releaseBatch` path uses: - RemotePrerenderer.disposeAffinity() — POST /dispose-affinity on the prerenderURL (manager or single prerender server). Best-effort fetch with an abort timer so a stuck upstream can't block the publish handler. - prerender-app POST /dispose-affinity — accepts JSON:API body with {affinityType, affinityValue}, calls into the concrete Prerenderer's existing disposeAffinity() method (which clears the auth cache and tears down all puppeteer pages for the affinity). - manager-app POST /dispose-affinity — fans the request out to every server currently assigned the affinity, mirroring how release-batch broadcasts. Each server runs its own local disposal; the broadcast resolves when all targets do. After this, the matrix republish test's second publish triggers disposeAffinity end-to-end: realm-server publish handler → RemotePrerenderer HTTP POST → manager fan-out → prerender server disposes pages. The next render against the realm spawns a fresh puppeteer page that fetches sentinel-card.gts from disk and sees the updated sentinel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The test keeps failing with the same shape — popup shows the initial sentinel after the second publish — but we can't tell which link in the chain is broken without ground-truth signals. This commit adds checkpoint logging and direct-fetch diagnostics so a CI failure tells us: - whether the source-content POST actually landed (compare GET'd source body against the updated sentinel value), - whether the default-domain checkbox state needed re-clicking after modal close+reopen, - whether the publish button was actually enabled at click time, - whether the /_publish-realm response was observed (and its status code if so), - what a plain HTTP fetch of the published URL returns vs. what the popup renders (decouples server-side staleness from browser-cache). Also adds the previously-missing default-domain-checkbox click on the second-publish path. The checkbox can lose its selection on modal close, leaving `isPublishDisabled` true and the publish click a silent no-op. The new `isChecked()` / `isDisabled()` checks make that visible, and the conditional re-click avoids the failure mode. Strictly diagnostic + bugfix. No assertion changes; if the test still fails, the console.log lines pinpoint which of the chain links is the actual broken one. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…index (CS-11043) Test instrumentation (commit 62b3ee9) pinpointed the actual broken link: even after disposeAffinity tears down the prerender server's puppeteer pages and the publish handler DELETEs the DB-level modules cache, the realm-server's OWN per-Realm #sourceCache still holds the pre-swap bytes. When the immediately- enqueued reindex's worker fetches modules via HTTP to this realm-server, getSourceOrRedirect returns the cached old bytes; the reindex renders against stale source; the rendered HTML lands in boxel_index.isolated_html; the published URL serves old content forever. The Phase-3-PR-2 publish flow relied on the NodeAdapter file watcher to invalidate the realm's caches via change events, but that's an async race against the immediately-enqueued reindex. The matrix republish test's direct-HTTP diagnostic confirmed the symptom: status=202 publish response, but the published URL served `contains-initial=true contains-updated=false`. Fix layered with the others: - clearLocalCaches() — new public method on Realm. Bulk- invalidates #sourceCache and the module cache. Different from __testOnlyClearCaches in that it leaves the test-only transpile counter alone. - handle-publish-realm — between upsertPublishedRealmInRegistry and enqueueReindexRealmJob, lookupOrMount the realm and call clearLocalCaches on it. For a republish (the bug case), this nukes the stale source bytes the reindex would otherwise pull. For a new publish, the mount is fresh and the call is a no-op. Combined with the earlier commits this closes the full chain: - Cache-Control: no-store on source responses (commit 5e3a2b3) → Chromium HTTP cache evicted. - disposeAffinity on publish (9c2af8a + ed53ad1) → prerender server puppeteer-page host-Loader caches evicted. - clearLocalCaches on publish (this commit) → realm-server per-Realm #sourceCache + module cache evicted. Each addresses a distinct layer; together they ensure the next reindex after a republish renders against the bytes that are actually on disk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…11043

The [CS-11043 ...ms] checkpoint logging + the direct-HTTP fetch comparison + the pre-assertion sentinel-text log were all added to diagnose where in the publish-republish chain things were breaking. They worked — they pinpointed the realm-server #sourceCache as the actual broken link, which the clearLocalCaches commit then fixed. With the bug closed, those console.logs are pure noise on every successful run and clutter the test's intent. Removed: - the step() helper definition - all [CS-11043 …ms] checkpoint calls - the direct request.get(publishedRealmURL) call before opening the popup (was to discriminate server-side staleness from browser-cache; both layers are addressed now) - the pre-assertion sentinelLocator.count()/textContent() log (standard Playwright assertion failures already show the expected/received values) Kept (structural improvements that make the test more robust, not just diagnostic): - the source-content read-back guard via request.get + expect: if postCardSource silently fails, the published-URL assertion below would otherwise fail with a misleading message - the domain-checkbox isChecked() guard + conditional click: defends against the modal losing its checkbox selection on close/reopen, which would make the publish click a silent no-op - the .catch(() => null) on waitForResponse + downstream if-guard: lets the test fall through to the URL assertion if the response wait is transiently lost, rather than failing on infrastructure noise Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

habdelra · 2026-05-14T01:51:02Z

are there tests that we can add that demonstrate these new capabilities--like the prerenderer's new dispose affinity endpoint?

habdelra · 2026-05-14T02:06:21Z

+          if (prerenderer?.disposeAffinity) {
+            try {
+              await prerenderer.disposeAffinity({
+                affinityType: 'realm',
+                affinityValue: publishedRealmURL,
+              });
+            } catch (e) {
+              log.warn(
+                `disposeAffinity failed for ${publishedRealmURL}: ${
+                  e instanceof Error ? e.message : String(e)
+                } — continuing with publish; stale Loader cache may persist until LRU rotation`,
+              );
+            }
+          }


this seems like it might be overkill. there is a clearCache: true option that the /render route for the prerenderer that was created specifically for this purpose: to clear the loader cache for the affinity. this is how we can use the same tab to handle both code changes and instance changes and always use the most recent code. When we do this we destroy the browser context for the realm in teh prerenderer which is bascially like throwing out the baby with the bath water. I'd be more interested to know why this option is not working for you

also, hopefully the etag should be working for the module updates. if the etag wasn't working i could see how disposing of the affinity would yield a working result, as that would destroy the browser context and thus force the browser to refetch the modules even if the etag didn't change.

thanks for looking at this, Claude agreed with your assessment, we went through several iterations of trying to diagnose the problem, and that did end up not being needed, as I’ve deployed this branch since removing it and publishing still worked.

There are now realm server tests for the new behaviours, too. Can you look again?

…11043

The dispose-affinity wiring was a symptom-treatment workaround: disposing the puppeteer pages after a republish made the bug go away by spawning fresh pages whose host-Loaders had no cached modules. But the actual cause was upstream — the realm-server's per-Realm #sourceCache was returning pre-swap bytes to whichever Loader fetched modules after the swap, regardless of whether that Loader was fresh. With Realm.clearLocalCaches() (commit ec5ddf9) now invalidating #sourceCache before the reindex enqueues, the existing IndexRunner clearCache:true-on-first-render mechanism is sufficient: the Loader reset re-fetches, the realm-server's source cache is already empty, and the response carries fresh bytes + fresh content-hash etag. Removes: - handle-publish-realm: prerenderer.disposeAffinity(...) call and the prerenderer destructure from CreateRoutesArgs - runtime-common Prerenderer interface: optional disposeAffinity - prerender-app: POST /dispose-affinity endpoint - manager-app: POST /dispose-affinity broadcast - RemotePrerenderer: disposeAffinity() client The internal Prerenderer.disposeAffinity / PagePool.disposeAffinity methods stay — they're still used by LRU rotation, mid-render cancel, and the existing prerendering tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds three regression tests covering the load-bearing layers of the publish-republish fix, complementing the matrix end-to-end test: - clearLocalCaches drops cached source bytes — warms the source cache via two source fetches (miss → hit), calls testRealm.clearLocalCaches(), asserts the next fetch is a miss. Directly exercises the public surface the publish handler now invokes after the FS swap. - source response sets Cache-Control: no-store — asserts the header on both miss-path and hit-path source responses. Documents the Chromium-cache contract the publish flow now depends on; if the header is ever dropped from defaultHeaders, Chromium will reintroduce the heuristic-cache vector that gave nyuitp2026 a 37-hour staleness window. - republishing reflects updated source content in boxel_index — writes a card with title sentinel-initial-<uuid>, publishes, waits for that title to appear in boxel_index.head_html; rewrites with sentinel-updated-<uuid>, republishes, waits for the new title and asserts no row still references the initial sentinel. This is the data-layer regression for CS-11043 — if clearLocalCaches() is regressed, the second waitUntil times out exactly as the production bug would have it. Faster than the matrix Playwright test and gives a clearer signal at the DB layer specifically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…11043

Two corrections from the first CI failure: - source response sets Cache-Control: no-store: the precondition "first fetch is a cache miss" failed because testRealm.write() triggers the indexer, which fetches the source server-side and warms #sourceCache before the client's first GET. Drop the warm entry via __testOnlyClearCaches() right after write so the assertion exercises the genuine miss-path defaultHeaders. - republishing reflects updated source content in boxel_index: the wait for the initial sentinel timed out because I set `attributes.title`, which CardDef doesn't have. Title lives on cardInfo.name (which feeds the computed cardTitle). Set attributes.cardInfo.name instead and assert against search_doc::text — substring-matching the jsonb-as-text means we don't have to encode the exact path (cardInfo.name vs the derived cardTitle), and search_doc is populated for every indexed instance regardless of head-render outcome. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Commit 5e3a2b3 added `cache-control: no-store` to defaultHeaders in `getSourceOrRedirect`, but serveLocalFile unconditionally sets `cache-control: '${public|private}, max-age=0'` after spreading defaultHeaders — the no-store value was clobbered every time. The test added in the previous commit caught this: the assertion failed because the actually-served value was `public, max-age=0`. The reason this never mattered for production: - The 5e3a2b3 commit attributed the 37 h staleness to Chromium's heuristic HTTP cache, but Chromium only falls back to heuristic caching when there's no explicit Cache-Control — `public, max-age=0` was already on every source response. With max-age=0, every fetch revalidates via etag. - The etag is content-hash based (etagBase: cached.contentHash). The CS-11043 production failure was that the realm-server's own #sourceCache returned stale bytes AND a stale content-hash etag together — so revalidation returned 304 for a hash that didn't match disk. The clearLocalCaches() fix (ec5ddf9) invalidates that cache before the reindex enqueues, breaking the chain at its actual cause. So 5e3a2b3 was based on incorrect analysis (Chromium heuristic caching) and didn't even take effect at runtime. Removing it makes the PR a single load-bearing fix (clearLocalCaches) with one test that exercises it (the republish-into-boxel_index regression) plus the unit test for clearLocalCaches itself. No dead code. Also drops the cache-control test from the previous commit since the contract it was asserting on isn't (and now never was) real. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

habdelra · 2026-05-14T17:41:20Z

+  // CS-11043. Bulk-invalidate this realm's in-process byte caches.
+  // Called by the publish-realm handler after the FS swap, BEFORE the
+  // reindex enqueues — so that subsequent source reads (which the
+  // reindex's prerender fans out across many of) bypass any
+  // pre-swap bytes the realm still has in `#sourceCache` /
+  // `#moduleCache`. The Phase-3-PR-2 publish flow relies on the
+  // NodeAdapter file-watcher to pick up the swap, but that's an
+  // async-event race against the immediately-enqueued reindex; this
+  // method makes the invalidation synchronous from the publish
+  // handler's vantage point. Different from `__testOnlyClearCaches`
+  // in that it does NOT reset the transpile counter (which is
+  // test-only diagnostic state, unrelated to byte-correctness).
+  clearLocalCaches(): void {
+    this.#sourceCache.clear();
+    this.#dropAllModuleCacheEntries();
+  }
+


you might want to coordinate with @lukemelia here. this touches on the work to move caching out of memory in the realm server in preparation for horizontally scaled realms. if we had 2 realm servers that needed cache clearing i'm not sure how you would coordinate this.

thanks, we talked it over and came up with an approach, I’m going to merge this and implement a multi-realm server fix in CS-11153.

Add test for clearing cache in prerenderer

08b8df6

backspace added 2 commits May 8, 2026 15:02

Add fix for publishing

cffae96

Merge remote-tracking branch 'origin/main' into published-caching-cs-…

4c737e5

…11043

github-actions Bot had a problem deploying to staging May 8, 2026 20:06 Failure

github-actions Bot temporarily deployed to staging May 8, 2026 21:12 Inactive

Merge remote-tracking branch 'origin/main' into published-caching-cs-…

a09b038

…11043

github-actions Bot temporarily deployed to staging May 11, 2026 11:31 Inactive

backspace and others added 3 commits May 11, 2026 09:29

Merge remote-tracking branch 'origin/main' into published-caching-cs-…

d05ce68

…11043

Merge remote-tracking branch 'origin/main' into published-caching-cs-…

1fe4890

…11043

github-actions Bot temporarily deployed to staging May 12, 2026 17:14 Inactive

backspace added 4 commits May 12, 2026 14:07

Add lint autofixes

14c8e35

Add more delays to Matrix test

ac4f020

I’m not sure…??

Add realm server test fixes

1e0bc65

Merge remote-tracking branch 'origin/main' into published-caching-cs-…

6879cac

…11043

github-actions Bot temporarily deployed to staging May 12, 2026 19:37 Inactive

backspace and others added 4 commits May 12, 2026 16:23

Add more granularity to cache-clearing

ce28331

Merge remote-tracking branch 'origin/main' into published-caching-cs-…

8811d12

…11043

github-actions Bot temporarily deployed to staging May 13, 2026 00:23 Inactive

backspace and others added 8 commits May 12, 2026 20:06

Fix test name

456ebe0

Remove checkbox click that actually turns off

1814842

Merge remote-tracking branch 'origin/main' into published-caching-cs-…

4269af0

…11043

Merge remote-tracking branch 'origin/main' into published-caching-cs-…

85e1891

…11043

Merge remote-tracking branch 'origin/main' into published-caching-cs-…

bbe9e10

…11043

github-actions Bot temporarily deployed to staging May 13, 2026 22:06 Inactive

Merge remote-tracking branch 'origin/main' into published-caching-cs-…

1c572cf

…11043

github-actions Bot temporarily deployed to staging May 13, 2026 22:39 Inactive

backspace marked this pull request as ready for review May 14, 2026 00:24

backspace requested review from habdelra and lukemelia May 14, 2026 01:44

habdelra reviewed May 14, 2026

View reviewed changes

backspace and others added 5 commits May 14, 2026 07:22

Merge remote-tracking branch 'origin/main' into published-caching-cs-…

4938930

…11043

Merge remote-tracking branch 'origin/main' into published-caching-cs-…

d33600b

…11043

Add lint autofix

3738f27

github-actions Bot temporarily deployed to staging May 14, 2026 13:41 Inactive

backspace and others added 2 commits May 14, 2026 09:23

habdelra reviewed May 14, 2026

View reviewed changes

habdelra approved these changes May 14, 2026

View reviewed changes

backspace merged commit fd727d1 into main May 14, 2026
67 of 68 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fix for prerenderer cache-clearing on publish#4719

Add fix for prerenderer cache-clearing on publish#4719
backspace merged 33 commits into
mainfrom
published-caching-cs-11043

backspace commented May 7, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 7, 2026 •

edited

Loading

Uh oh!

habdelra commented May 14, 2026

Uh oh!

habdelra May 14, 2026 •

edited

Loading

Uh oh!

habdelra May 14, 2026 •

edited

Loading

Uh oh!

backspace May 14, 2026

Uh oh!

habdelra May 14, 2026

Uh oh!

backspace May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

backspace commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bug

Root cause

Fix

Uh oh!

github-actions Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Preview deployments

Host Test Results

Realm Server Test Results

Uh oh!

habdelra commented May 14, 2026

Uh oh!

habdelra May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

habdelra May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

backspace May 14, 2026

Choose a reason for hiding this comment

Uh oh!

habdelra May 14, 2026

Choose a reason for hiding this comment

Uh oh!

backspace May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

backspace commented May 7, 2026 •

edited

Loading

github-actions Bot commented May 7, 2026 •

edited

Loading

habdelra May 14, 2026 •

edited

Loading

habdelra May 14, 2026 •

edited

Loading